TP5: Decision trees & random forests¶

The aim of this tutorial is to get familiar with the use of decision trees and their generalizations on simple examples using scikit-learn tools.

Completing your installation first¶

You will need to install packages python-graphviz first. If needed, uncomment the conda command below:

In [ ]:
# If needed, uncomment the line below:
# pip install graphviz
In [ ]:
from pylab import *
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris, load_wine
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import mean_absolute_error, accuracy_score
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
import graphviz 
import pandas as pd
import random

rng_seed = np.random.seed(0)

import warnings
warnings.filterwarnings("ignore")

The data for this tutorial is famous. Called, the iris dataset, it contains four variables measuring various parts of iris flowers of three related species, and then a fourth variable with the species name. The reason it is so famous in machine learning and statistics communities is because the data requires very little preprocessing (i.e. no missing values, all features are floating numbers, etc.).

In [ ]:
iris = load_iris()

Step 1: explore the data set¶

  1. What is the structure of the object iris ?

  2. Plot this dataset in a well chosen set of representations to explore the data.

iris is from the Bunch class of Sklearn. It is a kind of dictionary where elements can also be called as attributes. Iris being a dataset, it has this list of attributes:

  • data : the data matrix
  • target : The labels
  • feature_names : The name of the data attributes
  • target_names : The names of the target classes (flower names)
  • DESCR : The datset description
  • filename : The path to the location of the data

--> Given that iris is a dictionary, we use the pandas library to tranform it into a DataFrame and explore the data in a well chosen set.

Using pandas to manipulate the data¶

Pandas is great to manipulate data in a Microsoft Excel like way.

In [ ]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df.head()
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
In [ ]:
# Add a new column with the species names, this is what we are going to try to predict
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df.head()
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

Data vizualisation¶

In [ ]:
import seaborn as sns
In [ ]:
sns.pairplot(df, hue="species")
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7f8f12708d60>

Thanks to this visualization, we can see that it will be possible to separate the different species using the available attributes. However, we must use all of them because it is impossible to separate them using only 2 attributes.

In [ ]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')
Out[ ]:
  sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
sepal length (cm) 1.000000 -0.117570 0.871754 0.817941
sepal width (cm) -0.117570 1.000000 -0.428440 -0.366126
petal length (cm) 0.871754 -0.428440 1.000000 0.962865
petal width (cm) 0.817941 -0.366126 0.962865 1.000000

The attributes appear to be highly correlated. We can try later to use only attributes with little correlation, we should not get a drop in results.

Step 2: create training and test sets¶

Create a new column that for each row, generates a random number between 0 and 1, and if that value is less than or equal to .75, then sets the value of that cell as True and false otherwise. This is a quick and dirty way of randomly assigning some rows to be used as the training data and some as the test data.

In [ ]:
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df.head()
Out[ ]:
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) species is_train
0 5.1 3.5 1.4 0.2 setosa True
1 4.9 3.0 1.4 0.2 setosa True
2 4.7 3.2 1.3 0.2 setosa True
3 4.6 3.1 1.5 0.2 setosa True
4 5.0 3.6 1.4 0.2 setosa True
In [ ]:
# Create two new dataframes, one with the training rows, one with the test rows
train, test = df[df['is_train']==True], df[df['is_train']==False]
In [ ]:
# Show the number of observations for the test and training dataframes
print('Number of observations in the training data:', len(train))
print('Number of observations in the test data:',len(test))
Number of observations in the training data: 118
Number of observations in the test data: 32
In [ ]:
# Create a list of the feature column's names
features = df.columns[:4].tolist()

# View features
features
Out[ ]:
['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
In [ ]:
# train['species'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = pd.factorize(train['species'])[0]

Step 3: Decision trees for the iris dataset¶

The method tree.DecisionTreeClassifier() from scikit-learn builds decision trees objects as follows:

In [ ]:
clf = tree.DecisionTreeClassifier(random_state=rng_seed)
clf = clf.fit(train[features], y)

# Using the whole dataset you may use directly:
#clf = clf.fit(iris.data, iris.target)

The export_graphviz exporter supports a variety of aesthetic options, including coloring nodes by their class (or value for regression) and using explicit variable and class names if desired. Jupyter notebooks also render these plots inline automatically:

In [ ]:
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 
Out[ ]:
Tree 0 petal width (cm) ≤ 0.8 gini = 0.665 samples = 118 value = [37, 43, 38] class = versicolor 1 gini = 0.0 samples = 37 value = [37, 0, 0] class = setosa 0->1 True 2 petal length (cm) ≤ 4.75 gini = 0.498 samples = 81 value = [0, 43, 38] class = versicolor 0->2 False 3 petal width (cm) ≤ 1.65 gini = 0.048 samples = 41 value = [0, 40, 1] class = versicolor 2->3 6 petal width (cm) ≤ 1.75 gini = 0.139 samples = 40 value = [0, 3, 37] class = virginica 2->6 4 gini = 0.0 samples = 40 value = [0, 40, 0] class = versicolor 3->4 5 gini = 0.0 samples = 1 value = [0, 0, 1] class = virginica 3->5 7 sepal width (cm) ≤ 2.65 gini = 0.49 samples = 7 value = [0, 3, 4] class = virginica 6->7 14 gini = 0.0 samples = 33 value = [0, 0, 33] class = virginica 6->14 8 gini = 0.0 samples = 2 value = [0, 0, 2] class = virginica 7->8 9 petal length (cm) ≤ 5.05 gini = 0.48 samples = 5 value = [0, 3, 2] class = versicolor 7->9 10 gini = 0.0 samples = 2 value = [0, 2, 0] class = versicolor 9->10 11 sepal length (cm) ≤ 6.15 gini = 0.444 samples = 3 value = [0, 1, 2] class = virginica 9->11 12 gini = 0.0 samples = 1 value = [0, 1, 0] class = versicolor 11->12 13 gini = 0.0 samples = 2 value = [0, 0, 2] class = virginica 11->13

We can also export the tree in Graphviz format and savethe resulting graph in an output file iris.pdf:

In [ ]:
dot_data = tree.export_graphviz(clf, out_file=None) 
graph = graphviz.Source(dot_data) 
graph.render("iris") 
Out[ ]:
'iris.pdf'

After being fitted, the model can then be used to predict the class of samples:

In [ ]:
class_pred = clf.predict(iris.data[:1, :])
iris.target_names[class_pred[0]]
Out[ ]:
'setosa'

Exercise 1¶

  1. Train the decision tree on the iris dataset and explain how one should read blocks in graphviz representation of the tree.

  2. Plot the regions of decision with the points of the training set superimposed.

*Indication: you may find the function plt.contourf useful.

Réponse 1 : Pour entraîner l'arbre de décision sur les données du dataset Iris et les plots en utilisant Graphviz, on reprend les codes précédents :

In [ ]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
y = pd.factorize(train['species'])[0]
clf = tree.DecisionTreeClassifier(random_state=rng_seed)
clf = clf.fit(train[features], y)
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)
y_pred_train = clf.predict(train[features])

Explication : Un arbre de décision se compose de noeuds (carrés intermédiaires) et de feuilles (carrés en bout de chaîne). Chaque noeud comporte l'information relative à une décision. Si toutes les informations du noeud sont respectées, on passe au noeud enfant suivant la flèche True, sinon on passe au noeud enfant suivant la flèche False. Cela permet en bout de course d'arriver à une feuille qui comporte l'information de la décision prise par l'arbre au final.

Réponse 2 : On dessine les frontières de décision (entre chaque paire d'attributs) relative à l'arbre de décision entraîné précédemment. Le code provient de Sklearn.

In [ ]:
from sklearn.inspection import DecisionBoundaryDisplay
from itertools import combinations

attr = list(combinations(df.columns[:-2], 2))
attr = list(map(list, attr)) # List of all pairs of attributes in the iris dataset

for i, pair in enumerate(attr):
    X = df[pair].to_numpy()
    y = iris.target
    clf_attr = tree.DecisionTreeClassifier(random_state=rng_seed).fit(X,y)
    
    ax = plt.subplot(2,3,i+1)
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)
    # Decision Boundaries
    DecisionBoundaryDisplay.from_estimator(clf_attr, X, cmap = "brg", response_method="predict", ax=ax,
                                           xlabel=pair[0],
                                           ylabel=pair[1])    
    # Data Points
    for t, color in zip(range(3), "brg"):
        idx = np.where(t == y)
        plt.scatter(X[idx, 0],
                    X[idx, 1],
                    c = color,
                    cmap = "brg",
                    edgecolors="black",
                    s = 20)
        
plt.suptitle("Ensemble de frontières de décisions 2 à 2 données par un DecisionTree");

Comme nous pouvons le voir, les frontières de décisions ne sont pas linéaire du fait de l'utilisation d'un arbre de décision.


Exercise 2¶

  1. Build 2 different trees based on a sepal features (sepal lengths, sepal widths) vs petal features (petal lengths, petal widths) only: which features are the most discriminant?

  2. Compare performances with those obtained using all features.

  3. Try the same as above using the various splitting criterion available, Gini's index, classification error or cross-entropy. Comment on your results.

Réponse 1 : On crée deux arbres de décision basés sur les attributs liés aux sépales et ceux liés aux pétales.

In [ ]:
y = pd.factorize(train['species'])[0]
y_true = pd.factorize(test['species'])[0]
In [ ]:
features_sepal = ["sepal length (cm)", "sepal width (cm)"]
features_petal = ["petal length (cm)", "petal width (cm)"]

X_sepal = train[features_sepal]
clf_sepal = tree.DecisionTreeClassifier(random_state=rng_seed).fit(X_sepal, y)

X_petal = train[features_petal]
clf_petal = tree.DecisionTreeClassifier(random_state=rng_seed).fit(X_petal, y)
In [ ]:
y_pred_sepal = clf_sepal.predict(test[features_sepal])
y_pred_petal = clf_petal.predict(test[features_petal])
print(f"ACC (sepal) : {accuracy_score(y_true, y_pred_sepal)}")
print(f"ACC (petal) : {accuracy_score(y_true, y_pred_petal)}")
ACC (sepal) : 0.717948717948718
ACC (petal) : 0.9743589743589743
In [ ]:
from sklearn.metrics import confusion_matrix
In [ ]:
confusion_matrix_sepal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_sepal, iris.target_names), labels=iris.target_names)
confusion_matrix_petal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_petal, iris.target_names), labels=iris.target_names)
fig, ax = plt.subplots(1,2, figsize=(12,8))
sns.heatmap(confusion_matrix_sepal, ax=ax[0], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_petal, ax=ax[1], annot=True, cmap="YlGnBu")

ax[0].set_title("Confusion Matrix for Sepal")
ax[0].set_xlabel("Predictions values")
ax[0].set_ylabel("Actual values")
ax[0].xaxis.set_ticklabels(iris.target_names.tolist())
ax[0].yaxis.set_ticklabels(iris.target_names.tolist())

ax[1].set_title("Confusion Matrix for Petal")
ax[1].set_xlabel("Predictions values")
ax[1].set_ylabel("Actual values")
ax[1].xaxis.set_ticklabels(iris.target_names.tolist())
ax[1].yaxis.set_ticklabels(iris.target_names.tolist())
Out[ ]:
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]

On a estimé les performances des deux arbres de décision suivant les metrics d'accuracy et par la matrice de confusion.

  • Accuracy : Les prédictions sont plus précises par l'arbre de décision sur les pétales (environ 90%) même si un tel résultat doit être pris en considérant que le jeu de données a assez peu d'échantillons
  • Confusion matrix : Les résultats confirment l'accuracy avec des données bien prédites plus importantes lorsque l'on ne garde que les attributs liés aux pétales. On constate que les valeurs mal prédites lorsque l'on ne garde que les attributs liés aux sépales se répartissent sur la mauvaise prédiction entre Versicolour et Virginica

Réponse 2 : Comparons les performances avec les performances de l'arbre de décision qui garde tous les attributs

In [ ]:
X = train.iloc[:, 0:4]
clf_all = tree.DecisionTreeClassifier(random_state=rng_seed).fit(X, y)

y_pred_all = clf_all.predict(test.iloc[:, 0:4])
print(f"ACC (all) : {accuracy_score(y_true, y_pred_all)}")
confusion_matrix_all = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_all, iris.target_names), labels=iris.target_names)
ax = plt.subplot(1,1,1)
sns.heatmap(confusion_matrix_all, ax=ax, annot=True, cmap="YlGnBu")

ax.set_title("Confusion Matrix for all")
ax.set_xlabel("Predictions values")
ax.set_ylabel("Actual values")
ax.xaxis.set_ticklabels(iris.target_names.tolist())
ax.yaxis.set_ticklabels(iris.target_names.tolist())
ACC (all) : 0.9230769230769231
Out[ ]:
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]

On constate qu'en utilisant l'ensemble du jeu de données les prédictions sont excellentes comme pour l'abre de décision liés simplement aux pétales

  • ACC : Environ 97,1% (par rapport à 65,8% et 97,1%)
  • Confusion Matrix : 1 erreur entre Versicolour et Virginica

Réponse 3 : Changeons désormais le critère de décision pour comparer les performances

Arbre de décision avec criterion = entropy¶

In [ ]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
y = pd.factorize(train['species'])[0]
y_true = pd.factorize(test['species'])[0]
In [ ]:
features_sepal = ["sepal length (cm)", "sepal width (cm)"]
features_petal = ["petal length (cm)", "petal width (cm)"]

X_sepal = train[features_sepal]
clf_sepal = tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed).fit(X_sepal, y)

X_petal = train[features_petal]
clf_petal = tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed).fit(X_petal, y)

X = train.iloc[:, 0:4]
clf_all = tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed).fit(X, y)
In [ ]:
y_pred_sepal = clf_sepal.predict(test[features_sepal])
y_pred_petal = clf_petal.predict(test[features_petal])
y_pred_all = clf_all.predict(test.iloc[:, 0:4])
print(f"ACC (sepal) : {accuracy_score(y_true, y_pred_sepal)}")
print(f"ACC (petal) : {accuracy_score(y_true, y_pred_petal)}")
print(f"ACC (all) : {accuracy_score(y_true, y_pred_all)}")

confusion_matrix_sepal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_sepal, iris.target_names), labels=iris.target_names)
confusion_matrix_petal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_petal, iris.target_names), labels=iris.target_names)
confusion_matrix_all = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_all, iris.target_names), labels=iris.target_names)
fig, ax = plt.subplots(1,3, figsize=(12,8))
sns.heatmap(confusion_matrix_sepal, ax=ax[0], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_petal, ax=ax[1], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_all, ax=ax[2], annot=True, cmap="YlGnBu")

ax[0].set_title("Confusion Matrix for Sepal")
ax[0].set_xlabel("Predictions values")
ax[0].set_ylabel("Actual values")
ax[0].xaxis.set_ticklabels(iris.target_names.tolist())
ax[0].yaxis.set_ticklabels(iris.target_names.tolist())

ax[1].set_title("Confusion Matrix for Petal")
ax[1].set_xlabel("Predictions values")
ax[1].set_ylabel("Actual values")
ax[1].xaxis.set_ticklabels(iris.target_names.tolist())
ax[1].yaxis.set_ticklabels(iris.target_names.tolist())

ax[2].set_title("Confusion Matrix for all")
ax[2].set_xlabel("Predictions values")
ax[2].set_ylabel("Actual values")
ax[2].xaxis.set_ticklabels(iris.target_names.tolist())
ax[2].yaxis.set_ticklabels(iris.target_names.tolist())
ACC (sepal) : 0.7631578947368421
ACC (petal) : 0.9210526315789473
ACC (all) : 0.9210526315789473
Out[ ]:
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]

Arbre de décision avec criterion = log_loss¶

In [ ]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
y = pd.factorize(train['species'])[0]
y_true = pd.factorize(test['species'])[0]
In [ ]:
features_sepal = ["sepal length (cm)", "sepal width (cm)"]
features_petal = ["petal length (cm)", "petal width (cm)"]

X_sepal = train[features_sepal]
clf_sepal = tree.DecisionTreeClassifier(criterion="log_loss", random_state=rng_seed).fit(X_sepal, y)

X_petal = train[features_petal]
clf_petal = tree.DecisionTreeClassifier(criterion="log_loss", random_state=rng_seed).fit(X_petal, y)

X = train.iloc[:, 0:4]
clf_all = tree.DecisionTreeClassifier(criterion="log_loss", random_state=rng_seed).fit(X, y)
In [ ]:
y_pred_sepal = clf_sepal.predict(test[features_sepal])
y_pred_petal = clf_petal.predict(test[features_petal])
y_pred_all = clf_all.predict(test.iloc[:, 0:4])
print(f"ACC (sepal) : {accuracy_score(y_true, y_pred_sepal)}")
print(f"ACC (petal) : {accuracy_score(y_true, y_pred_petal)}")
print(f"ACC (all) : {accuracy_score(y_true, y_pred_all)}")

confusion_matrix_sepal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_sepal, iris.target_names), labels=iris.target_names)
confusion_matrix_petal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_petal, iris.target_names), labels=iris.target_names)
confusion_matrix_all = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_all, iris.target_names), labels=iris.target_names)
fig, ax = plt.subplots(1,3, figsize=(12,8))
sns.heatmap(confusion_matrix_sepal, ax=ax[0], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_petal, ax=ax[1], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_all, ax=ax[2], annot=True, cmap="YlGnBu")

ax[0].set_title("Confusion Matrix for Sepal")
ax[0].set_xlabel("Predictions values")
ax[0].set_ylabel("Actual values")
ax[0].xaxis.set_ticklabels(iris.target_names.tolist())
ax[0].yaxis.set_ticklabels(iris.target_names.tolist())

ax[1].set_title("Confusion Matrix for Petal")
ax[1].set_xlabel("Predictions values")
ax[1].set_ylabel("Actual values")
ax[1].xaxis.set_ticklabels(iris.target_names.tolist())
ax[1].yaxis.set_ticklabels(iris.target_names.tolist())

ax[2].set_title("Confusion Matrix for all")
ax[2].set_xlabel("Predictions values")
ax[2].set_ylabel("Actual values")
ax[2].xaxis.set_ticklabels(iris.target_names.tolist())
ax[2].yaxis.set_ticklabels(iris.target_names.tolist())
ACC (sepal) : 0.7142857142857143
ACC (petal) : 0.9428571428571428
ACC (all) : 0.9428571428571428
Out[ ]:
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]
In [ ]:
features_sepal = ["sepal length (cm)", "sepal width (cm)"]
features_petal = ["petal length (cm)", "petal width (cm)"]

X_sepal = train[features_sepal]
clf_sepal = tree.DecisionTreeClassifier(criterion="gini", random_state=rng_seed).fit(X_sepal, y)

X_petal = train[features_petal]
clf_petal = tree.DecisionTreeClassifier(criterion="gini", random_state=rng_seed).fit(X_petal, y)

X = train.iloc[:, 0:4]
clf_all = tree.DecisionTreeClassifier(criterion="gini", random_state=rng_seed).fit(X, y)
In [ ]:
y_pred_sepal = clf_sepal.predict(test[features_sepal])
y_pred_petal = clf_petal.predict(test[features_petal])
y_pred_all = clf_all.predict(test.iloc[:, 0:4])
print(f"ACC (sepal) : {accuracy_score(y_true, y_pred_sepal)}")
print(f"ACC (petal) : {accuracy_score(y_true, y_pred_petal)}")
print(f"ACC (all) : {accuracy_score(y_true, y_pred_all)}")

confusion_matrix_sepal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_sepal, iris.target_names), labels=iris.target_names)
confusion_matrix_petal = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_petal, iris.target_names), labels=iris.target_names)
confusion_matrix_all = confusion_matrix(np.choose(y_true, iris.target_names), np.choose(y_pred_all, iris.target_names), labels=iris.target_names)
fig, ax = plt.subplots(1,3, figsize=(12,8))
sns.heatmap(confusion_matrix_sepal, ax=ax[0], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_petal, ax=ax[1], annot=True, cmap="YlGnBu")
sns.heatmap(confusion_matrix_all, ax=ax[2], annot=True, cmap="YlGnBu")

ax[0].set_title("Confusion Matrix for Sepal")
ax[0].set_xlabel("Predictions values")
ax[0].set_ylabel("Actual values")
ax[0].xaxis.set_ticklabels(iris.target_names.tolist())
ax[0].yaxis.set_ticklabels(iris.target_names.tolist())

ax[1].set_title("Confusion Matrix for Petal")
ax[1].set_xlabel("Predictions values")
ax[1].set_ylabel("Actual values")
ax[1].xaxis.set_ticklabels(iris.target_names.tolist())
ax[1].yaxis.set_ticklabels(iris.target_names.tolist())

ax[2].set_title("Confusion Matrix for all")
ax[2].set_xlabel("Predictions values")
ax[2].set_ylabel("Actual values")
ax[2].xaxis.set_ticklabels(iris.target_names.tolist())
ax[2].yaxis.set_ticklabels(iris.target_names.tolist())
ACC (sepal) : 0.7631578947368421
ACC (petal) : 0.9210526315789473
ACC (all) : 0.9473684210526315
Out[ ]:
[Text(0, 0.5, 'setosa'), Text(0, 1.5, 'versicolor'), Text(0, 2.5, 'virginica')]

On met tout sous forme de tableau pour résumer :

In [ ]:
from IPython import display
display.Image("Summary_table.png")
Out[ ]:

On constate presque toujours une précision similaire entre le jeu de données réduit aux données de sépales et le jeu de données total. Avec plus de données et un problème plus complexe nous pourrions peut être voir les avantages et inconvénients de chaque critère de décision, mais ce n'est pas le cas ici.

Going further ahead (not mandatory)¶

Try the same approach adapted to another toy dataset from scikit-learn described at: http://scikit-learn.org/stable/datasets/index.html

Play with another dataset available at: http://archive.ics.uci.edu/ml/datasets.html

In [ ]:
wine= load_wine(as_frame=True)
data_full = wine.data.copy()
data_full['classes'] = wine.target.copy()
X = wine.data.copy()
y = wine.target.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=rng_seed)
In [ ]:
sns.pairplot(data_full, hue="classes")
Out[ ]:
<seaborn.axisgrid.PairGrid at 0x7f8e358b7d00>

Il est cette fois-ci plus difficile de voir une classification triviale à partir des attributs mis 2 à 2. Cependant, il y a quelques attributs qui sparent bien les différentes classes (alcohol, macid_alic, color_intensity,...), ce sont ces attributs qu'il faut prioriser.

In [ ]:
corr = X.corr()
corr.style.background_gradient(cmap='coolwarm')
Out[ ]:
  alcohol malic_acid ash alcalinity_of_ash magnesium total_phenols flavanoids nonflavanoid_phenols proanthocyanins color_intensity hue od280/od315_of_diluted_wines proline
alcohol 1.000000 0.094397 0.211545 -0.310235 0.270798 0.289101 0.236815 -0.155929 0.136698 0.546364 -0.071747 0.072343 0.643720
malic_acid 0.094397 1.000000 0.164045 0.288500 -0.054575 -0.335167 -0.411007 0.292977 -0.220746 0.248985 -0.561296 -0.368710 -0.192011
ash 0.211545 0.164045 1.000000 0.443367 0.286587 0.128980 0.115077 0.186230 0.009652 0.258887 -0.074667 0.003911 0.223626
alcalinity_of_ash -0.310235 0.288500 0.443367 1.000000 -0.083333 -0.321113 -0.351370 0.361922 -0.197327 0.018732 -0.273955 -0.276769 -0.440597
magnesium 0.270798 -0.054575 0.286587 -0.083333 1.000000 0.214401 0.195784 -0.256294 0.236441 0.199950 0.055398 0.066004 0.393351
total_phenols 0.289101 -0.335167 0.128980 -0.321113 0.214401 1.000000 0.864564 -0.449935 0.612413 -0.055136 0.433681 0.699949 0.498115
flavanoids 0.236815 -0.411007 0.115077 -0.351370 0.195784 0.864564 1.000000 -0.537900 0.652692 -0.172379 0.543479 0.787194 0.494193
nonflavanoid_phenols -0.155929 0.292977 0.186230 0.361922 -0.256294 -0.449935 -0.537900 1.000000 -0.365845 0.139057 -0.262640 -0.503270 -0.311385
proanthocyanins 0.136698 -0.220746 0.009652 -0.197327 0.236441 0.612413 0.652692 -0.365845 1.000000 -0.025250 0.295544 0.519067 0.330417
color_intensity 0.546364 0.248985 0.258887 0.018732 0.199950 -0.055136 -0.172379 0.139057 -0.025250 1.000000 -0.521813 -0.428815 0.316100
hue -0.071747 -0.561296 -0.074667 -0.273955 0.055398 0.433681 0.543479 -0.262640 0.295544 -0.521813 1.000000 0.565468 0.236183
od280/od315_of_diluted_wines 0.072343 -0.368710 0.003911 -0.276769 0.066004 0.699949 0.787194 -0.503270 0.519067 -0.428815 0.565468 1.000000 0.312761
proline 0.643720 -0.192011 0.223626 -0.440597 0.393351 0.498115 0.494193 -0.311385 0.330417 0.316100 0.236183 0.312761 1.000000

Les données sont beaucoup moins corrélés que pour le dataset d'iris, nous pouvons utiliser le dataset dans son entiereté.

In [ ]:
models = [tree.DecisionTreeClassifier(random_state=rng_seed),
          tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed),
          tree.DecisionTreeClassifier(criterion="log_loss", random_state=rng_seed),
          RandomForestClassifier(n_estimators=30, max_depth=4, random_state=rng_seed)]
models_name = ["Decision Tree Gini", "Decision Tree Entropy", "Decision Tree Logloss", "Random Forest"]
colors = ["orange", "red", "blue", "green"]
In [ ]:
acc = []
fig, ax = plt.subplots(2,2, figsize=(16, 8))
for i, model in enumerate(models):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc.append(accuracy_score(y_test, y_pred))
    cm = confusion_matrix(np.choose(y_test, iris.target_names), np.choose(y_pred, iris.target_names), labels=iris.target_names)
    sns.heatmap(cm, ax=ax[i//2][i%2], annot=True, cmap="YlGnBu")
    ax[i//2][i%2].set_title(f"CM for {models_name[i]}")
    ax[i//2][i%2].xaxis.set_ticklabels(iris.target_names.tolist())
    ax[i//2][i%2].yaxis.set_ticklabels(iris.target_names.tolist())
In [ ]:
fig, ax = plt.subplots(1, 1, figsize=(16, 8))
for i, model in enumerate(models):
    plt.bar(x=np.arange(4)[i], height=acc[i], label=models_name[i])
    plt.title("Accuracy score for each model")
    plt.ylabel("ACC")
    plt.legend()

Interprétation : Encore une fois, on remarque que l'utilisation d'une Random Forest est plus efficace de part la réduction de variance dû à l'utilisation de plusieurs arbres de décision. Les méthodes Gini, Entropy et Log_loss sont peu déterminantes au vu de la performance bien plus élevée de la Random Forest. Il est aussi important de constater que l'on obtient déjà plus de 80% d'accuracy avec des modèles aussi simples que des arbres de décisions ce qui invite tout lecteur à ne pas négliger leur utilisation à l'abord d'un problème de classification ou de régression supervisée.

Step 4: Random forests¶

Go to

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

for a documentation about the RandomForestClassifier provided by scikit-learn.

Since target values must be integers, we first need to transform labels into numbers as below.

In [ ]:
# train['species'] contains the actual species names. Before we can use it,
# we need to convert each species name into a digit. So, in this case there
# are three species, which have been coded as 0, 1, or 2.
y = pd.factorize(train['species'])[0]

# View target
y
Out[ ]:
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])
In [ ]:
# Create a random forest Classifier. By convention, clf means 'Classifier'
rf = RandomForestClassifier(n_jobs=2, random_state=0)

# Train the Classifier to take the training features and learn how they relate
# to the training y (the species)
rf.fit(train[features], y)
Out[ ]:
RandomForestClassifier(n_jobs=2, random_state=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(n_jobs=2, random_state=0)

Make predictions and create actual english names for the plants for each predicted plant class:

In [ ]:
preds = rf.predict(test[features])
preds_names = pd.Categorical.from_codes(preds, iris.target_names)
preds_names
Out[ ]:
['setosa', 'setosa', 'setosa', 'setosa', 'setosa', ..., 'virginica', 'virginica', 'virginica', 'virginica', 'virginica']
Length: 38
Categories (3, object): ['setosa', 'versicolor', 'virginica']

Create a confusion matrix¶

In [ ]:
cm = confusion_matrix(test['species'], preds_names, labels=iris.target_names)
ax = plt.subplot(1,1,1)
sns.heatmap(cm, ax=ax, annot=True, cmap="YlGnBu")
ax.xaxis.set_ticklabels(iris.target_names.tolist())
ax.yaxis.set_ticklabels(iris.target_names.tolist())
plt.title('Confusion matrix Random Forest')
plt.show()

Feature selection using random forests byproducts¶

One of the interesting use cases for random forest is feature selection. One of the byproducts of trying lots of decision tree variations is that you can examine which variables are working best/worst in each tree.

When a certain tree uses one variable and another doesn't, you can compare the value lost or gained from the inclusion/exclusion of that variable. The good random forest implementations are going to do that for you, so all you need to do is know which method or variable to look at.

View feature importance¶

While we don't get regression coefficients like with ordinary least squares (OLS), we do get a score telling us how important each feature was in classifying. This is one of the most powerful parts of random forests, because we can clearly see that petal attributes was more important in classification than sepal attributes.

In [ ]:
# View a list of the features and their importance scores
list(zip(rf.feature_names_in_, rf.feature_importances_))
Out[ ]:
[('sepal length (cm)', 0.07878326083724368),
 ('sepal width (cm)', 0.0268273291594686),
 ('petal length (cm)', 0.40142608683728426),
 ('petal width (cm)', 0.49296332316600344)]

Exercise 3¶

  1. Comment on the feature importances with respect to your previous observations on decision trees above.

  2. Extract and visualize 5 trees belonging to the random forest using the attribute estimators_ of the trained random forest classifier. Compare them. Note that you may code a loop on extracted trees.

  3. Study the influence of parameters like max_depth, min_samples_leaf and min_samples_split. Try to optimize them and explain your approach and choices.

  4. How is estimated the prediction error of a random forest ?

Indication: have a look at parameter oob_score. What are out-of-bag samples ?

  1. What should you do when classes are not balanced in the dataset ? (that is when there are much more examples of one class than another)

Réponse 1 : A la lumière des résultats précédents, on se rend compte que les attributs liés aux pétales sont bien plus importants dans la classfication que les attributs liés aux sépales. Cela éclaire donc bien les résultats précédents qui montraient qu'un arbre de décision simplement sur les attributs liés aux pétales est déjà très efficace.

Réponse 2 : On extrait puis on enregistre dans rf_results 5 arbres de décision de la Random Forest.

In [ ]:
random.seed(10)
trees = random.sample(rf.estimators_, k=5)
for i, one_tree in enumerate(trees):
    dot_data = tree.export_graphviz(one_tree, out_file=None, 
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
    graph = graphviz.Source(dot_data)
    graph.render(f"rf_results/tree_{i}")
    y_pred = one_tree.predict(test.iloc[:, 0:4])
    print(f"ACC for tree_{i} = {accuracy_score(y_true, y_pred)}")
ACC for tree_0 = 0.9473684210526315
ACC for tree_1 = 0.9210526315789473
ACC for tree_2 = 0.9473684210526315
ACC for tree_3 = 0.9473684210526315
ACC for tree_4 = 0.9210526315789473

En particulier, calculons une forme de score pour chaque arbre sous la forme d'une somme noramlisée des scores d'importances calculés précédemment :

  • Tree_0 : $\frac{2*0,46 + 3*0,42 + 2*0,08 + 2*0,035}{9} = \frac{2,41}{9} = 0,268$
  • Tree_1 : $\frac{1*0,46 + 4*0,42 + 0*0,08 + 0*0,035}{5} = \frac{2,14}{5} = 0,428$
  • Tree_2 : $\frac{3*0,46 + 1*0,42 + 0*0,08 + 1*0,035}{5} = \frac{1,835}{5} = 0,367$
  • Tree_3 : $\frac{3*0,46 + 1*0,42 + 2*0,08 + 0*0,035}{6} = \frac{1,96}{6} = 0,327$
  • Tree_4 : $\frac{0*0,46 + 3*0,42 + 0*0,08 + 1*0,035}{4} = \frac{1,295}{4} = 0,324$

Donc en termes d'"importance", les deux arbres ayant un score d'importance les plus faibles sont bien les arbres 0 et 4 qui sont ceux ayant une accuracy plus faible.

Réponse 3 : Jouons avec les hyperparamètres.

In [ ]:
df = pd.DataFrame(iris.data, columns=iris.feature_names)

df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
train, test = df[df['is_train']==True], df[df['is_train']==False]
y = pd.factorize(train['species'])[0]
y_true = pd.factorize(test['species'])[0]
In [ ]:
DEPTH = np.linspace(2,8,7).astype(int)
abs, ord = [], []
for max_depth in DEPTH:
    rf = RandomForestClassifier(n_jobs=2, max_depth=max_depth, random_state=rng_seed)
    rf.fit(train[features], y)
    y_pred = rf.predict(test[features])
    abs.append(max_depth)
    ord.append(accuracy_score(y_true, y_pred))
plt.plot(abs, ord)
plt.title("Evolution de la précision selon la profondeur maximale")
plt.xlabel("Max_depth")
plt.ylabel("Accuracy score")
Out[ ]:
Text(0, 0.5, 'Accuracy score')
In [ ]:
LEAF = np.linspace(1,20,20).astype(int)
abs, ord = [], []
for min_samples_leaf in LEAF:
    rf = RandomForestClassifier(n_jobs=2, min_samples_leaf=min_samples_leaf, random_state=rng_seed)
    rf.fit(train[features], y)
    y_pred = rf.predict(test[features])
    abs.append(min_samples_leaf)
    ord.append(accuracy_score(y_true, y_pred))
plt.plot(abs, ord)
plt.title("Evolution de la précision selon min_samples_leaf")
plt.xlabel("min_samples_leaf")
plt.ylabel("Accuracy score")
Out[ ]:
Text(0, 0.5, 'Accuracy score')
In [ ]:
SPLIT = np.linspace(2,40,39).astype(int)
abs, ord = [], []
for min_samples_split in SPLIT:
    rf = RandomForestClassifier(n_jobs=2, min_samples_split=min_samples_split, random_state=rng_seed)
    rf.fit(train[features], y)
    y_pred = rf.predict(test[features])
    abs.append(min_samples_split)
    ord.append(accuracy_score(y_true, y_pred))
plt.plot(abs, ord)
plt.title("Evolution de la précision selon min_samples_split")
plt.xlabel("min_samples_leaf")
plt.ylabel("Accuracy score")
Out[ ]:
Text(0, 0.5, 'Accuracy score')

Après avoir identifié l'effet de chaque hyperparamètre individuellement, on regarde l'effet que chaque combinaison a de manière combinée. Cela prend plus de temps (environ 5min) algorithmiquement parlant mais est censé donner un résultat optimal.

In [ ]:
from tqdm import tqdm

DEPTH = np.linspace(2,8,7).astype(int)
LEAF = np.linspace(1,8,8).astype(int)
SPLIT = np.linspace(2,40,40).astype(int)
scores = [0]
for i, max_depth in tqdm(enumerate(DEPTH), total=len(DEPTH)) :
    for j, min_samples_leaf in enumerate(LEAF):
        for k, min_samples_split in enumerate(SPLIT):
            rf = RandomForestClassifier(n_jobs=2, max_depth=max_depth, 
                                min_samples_leaf=min_samples_leaf,
                                min_samples_split=min_samples_split, random_state=rng_seed)
            rf.fit(train[features], y)
            y_pred = rf.predict(test[features])
            if accuracy_score(y_true, y_pred) > max(scores):
                best_combination = (max_depth, min_samples_leaf, min_samples_split)
                print(best_combination)
                print(accuracy_score(y_true, y_pred))
            scores.append(accuracy_score(y_true, y_pred))
  0%|          | 0/7 [00:00<?, ?it/s]
(2, 1, 2)
0.9705882352941176
 43%|████▎     | 3/7 [02:46<03:39, 54.95s/it]
(5, 8, 35)
1.0
100%|██████████| 7/7 [06:19<00:00, 54.19s/it]
In [ ]:
print("La meilleure combinaison est : ")
print(f"- Max_depth = {best_combination[0]}")
print(f"- Min_samples_leaf = {best_combination[1]}")
print(f"- Min_samples_split = {best_combination[2]}")
La meilleure combinaison est : 
- Max_depth = 5
- Min_samples_leaf = 8
- Min_samples_split = 35

Nous arrivons à avoir une précision de 1 avec le dataset de test généré.

Cette combinaison semble être cohérente avec l'effet individuel des hyperparamètres sur les prédictions dessiné précédemment. Si la profondeur optimale (au regard des autres hyperparamètres) semble être faible, il faut se rappeler qu'une forêt aléatoire tire sa force du principe de comités experts où elle peut obtenir de très bon résultat à partir d'un vote selon un ensemble d'arbres très simples ! Cependant, il reste difficile d'en tirer de bonnes conclusions étant donné la simplicité du dataset et de la classification demandée.

La technique que nous avons utiisé est très couteuse et n'est pas possible en un temps raisonnable pour un très grand dataset, le moyen le plus simple reste d'optimiser les paramètres un par un en commencant par min_samples_leaf et min_samples_split, qui, d'après le papier de recherche "An empirical study on hyperparameter tuning of deciion trees" de Rafael Gomes Mantovani, Tomas Horvath, Ricardo Cerri, Sylvio Barbon Junior, Joaquin Vanschoren, André Carlo Ponce de Leon Ferrreira de Carvalho, indique que ce sont ces hyperparamètres qui influent le plus.

Réponse 4 : Pour estimer l'erreur de prédiction on utilise le "out-of-bag error"(OOB error). Le RandomForestClassifier est entraîné en utilisant l'agrégation bootstrap, où chaque nouvel arbre est ajusté à partir d'un échantillon bootstrap des observations d'entraînement $z_i = (x_i, y_i)$. L'erreur out-of-bag (OOB) est l'erreur moyenne pour chaque $z_i$ calculée en utilisant les prédictions des arbres qui ne contiennent pas $z_i$ dans leur échantillon bootstrap respectif (OOB sample). Cela permet d'ajuster et de valider le RandomForestClassifier tout en le formant.

Réponse 5 : Lorsque le dataset présente des classes en quantité non équivalentes (comme ça peut être le cas dans la détection de fraudes par exemple) l'algorithme reçoit beaucoup plus d'exemples d'une classe, ce qui l'incite à privilégier cette classe particulière. Il n'apprend pas ce qui rend l'autre classe "différente" et ne parvient pas à comprendre les modèles sous-jacents qui nous permettent de distinguer les classes.

Pour y pallier, on peut :

  • Recueillir plus de données sur la classe minoritaire
  • Faire du datasampling :
    • Undersampling : On choisit un sous-ensemble du dataset permettant d'avoir les classes représentées de manière équivalentes
    • Oversampling : On crée des observations synthétiques de la classe minoritaire avec des algorithmes comme les VAE (Variational Encoders), SMOTE (Synthetic Minority Over-Sampling Technique), MSMOTE (Modified Synthetic Minority Over-Sampling Technique)
  • Utiliser un modèle résistant à la différence de proportions dans les classes
  • On peut changer d'approche et regarder un problème comme celui-ci sous la force de la détection d'anomalies pour lequel on peut utiliser un SVM à une classe
  • Utiliser des modèles pénalisables

Step 5: a small example of regression using random forests¶

Random forest is capable of learning without carefully crafted data transformations. Take the the $f(x) = \sin(x)$ function for example.

Create some fake data and add a little noise.

In [ ]:
x = np.random.uniform(-2.5, 2.5, 1000)
y = np.sin(x) + np.random.normal(0, .1, 1000)

plt.plot(x,y,'ko',markersize=1,label='data')
plt.plot(np.arange(-2.5,2.5,0.1),np.sin(np.arange(-2.5,2.5,0.1)),'r-',label='ref')
plt.show()

If we try and build a basic linear model to predict y using x we end up with a straight line that sort of bisects the sin(x) function. Whereas if we use a random forest, it does a much better job of approximating the sin(x) curve and we get something that looks much more like the true function.

Based on this example, we will illustrate how the random forest isn't bound by linear constraints.

Exercise 4¶

  1. Apply random forests on this dataset for regression and compare performances with ordinary least squares regression.

Note that ordinay least square regression is available thanks to: from sklearn.linear_model import LinearRegression

  1. Comment on your results.

Indications:¶

You may use half of points for training and others to test predictions. Then you will have an idea of how far the random forest predictor fits the sinus curve.

To this aim, you will need to use the model RandomForestRegressor. Be careful that when only 1 feature x is used as an input, you will need to reshape it by x.reshape(-1,1) when using methods fit and predict.

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.5, random_state=rng_seed)

Indication¶

One clever way to compare models when using scikit-learnis to make a loop on models as follows:

In [ ]:
models = [LinearRegression(fit_intercept=True),
          tree.DecisionTreeRegressor(random_state=rng_seed),
          RandomForestRegressor(n_estimators=30, max_depth=4, random_state=rng_seed)]
models_name = ["Linear Regression", "Decision Tree", "Random Forest"]
colors = ["orange", "blue", "green"]
In [ ]:
mae = []

fig, ax = plt.subplots(1,2, figsize=(16,8))
ax[0].plot(np.arange(-2.5,2.5,0.1),np.sin(np.arange(-2.5,2.5,0.1)),'r-',label='Theoretical curve')
for i, model in enumerate(models):
    model.fit(X_train.reshape(-1,1), y_train)
    y_pred = model.predict(X_test.reshape(-1,1))
    mae.append(mean_absolute_error(y_test, y_pred))
    if i == 0:
        ax[0].plot(X_test, y_pred, label=models_name[i], color=colors[i])
    else :
        ax[0].scatter(X_test, y_pred, label=models_name[i], color=colors[i], edgecolors="black", s=20)
    ax[0].legend()
    ax[0].set_xlabel("x")
    ax[0].set_ylabel("y")
    ax[0].set_title("Approximation  of the sinus function")
    

    ax[1].bar(x=np.arange(3)[i], height=mae[i], label=models_name[i])
    ax[1].set_title("Mean Absolute Error for each model")
    ax[1].set_ylabel("MAE")
    ax[1].legend()

Interprétation : La fonction sinus est une fonction non linéaire, ce qui explique que la méthode OLS donne une approximation grossière étant donné qu'elle essaye d'approximer linéairement. Contrairement à la méthode OLS, les arbres de décision ne supposent pas que les données suivent un modèle linéaire et peuvent ainsi permettrent d'approximer correctement des problèmes non linéaires. On constate également que les familles d'algorithmes par comités d'experts dont font partie les Random Forest permettent de réduire la variance des prédictions. Cela se constate sur le scatterplot où l'on voit bien que les points verts sont "moins dispersés" autour de la courbe. Cela permet d'obtenir une meilleure approximation comme le prouve la valeur de la Mean_Absolute_Error.

Documentation¶

Decision trees¶

http://scikit-learn.org/stable/modules/tree.html

Random forests¶

http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Plot decision surface : using plt.contourf¶

http://scikit-learn.org/stable/auto_examples/tree/plot_iris.html#sphx-glr-auto-examples-tree-plot-iris-py

Pruning trees: not available in scikit-learn.¶

Since post-pruning of tree is not implemented in scikit-learn, you may think of coding your own pruning function. For instance, taking into account the numer of samples per leaf as proposed below:

In [ ]:
# Pruning function (useful ?)
def prune(decisiontree, min_samples_leaf = 1):
    if decisiontree.min_samples_leaf <= min_samples_leaf:
        raise Exception('Tree already more pruned')
    else:
        decisiontree.min_samples_leaf = min_samples_leaf
        tree = decisiontree.tree_
        for i in range(tree.node_count):
            n_samples = tree.n_node_samples[i]
            if n_samples <= min_samples_leaf:
                tree.children_left[i]=-1
                tree.children_right[i]=-1
                

Reprenons l'exemple du dataset wine pour voir l'impact du post pruning sur un arbre. Le post pruning se fait sur un arbre seul pour éviter l'overfitting. Un modèle random forest utilisant du bootstrapping avec plusieurs arbres (faible correlation entre les arbres), il n'est pas nécessaire de faire du post-pruning.

L'arbre avec les meilleurs résultats était celui utilisant le criterion entropy, nous allons le réutiliser.

In [ ]:
wine= load_wine(as_frame=True)
data_full = wine.data.copy()
data_full['classes'] = wine.target.copy()
X = wine.data.copy()
y = wine.target.copy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.8, random_state=rng_seed)
In [ ]:
clf = tree.DecisionTreeClassifier(criterion="entropy", random_state=rng_seed)
In [ ]:
clf.fit(X_train, y_train)
Out[ ]:
DecisionTreeClassifier(criterion='entropy')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(criterion='entropy')
In [ ]:
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=wine.feature_names,  
                         class_names=wine.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = graphviz.Source(dot_data)  
graph 
Out[ ]:
Tree 0 od280/od315_of_diluted_wines ≤ 2.495 entropy = 1.58 samples = 35 value = [13, 11, 11] class = class_0 1 color_intensity ≤ 3.635 entropy = 0.414 samples = 12 value = [0, 1, 11] class = class_2 0->1 True 4 proline ≤ 774.0 entropy = 0.988 samples = 23 value = [13, 10, 0] class = class_0 0->4 False 2 entropy = 0.0 samples = 1 value = [0, 1, 0] class = class_1 1->2 3 entropy = 0.0 samples = 11 value = [0, 0, 11] class = class_2 1->3 5 entropy = 0.0 samples = 10 value = [0, 10, 0] class = class_1 4->5 6 entropy = 0.0 samples = 13 value = [13, 0, 0] class = class_0 4->6

Notre arbre n'a pas assez de noeuds et de feuilles pour utiliser du post-prune.